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1.  Introduction 


Since  their  introduction  in  1988,  Cellular  Netu-al  Networks  have  shown  a  vast  comput¬ 
ing  power,  especially  for  image  processing  [1,2, 3, 4].  Many  VLSI  implementations  of  CNN 
analog  neural  networks  have  been  proposed  in  recent  years  [15,18,23].  These  implemen¬ 
tations  include  OTA  based  processing  elements  [8],  discrete  time  implementations  [5,10], 
switch-current  signal  processing  elements  [14],  and  the  current-mode  [13]  implementation. 
Each  kind  of  CNN  realization  has  its  own  advantages  and  disadvantages.  For  example, 
the  discrete-time  CNN  can  yield  “exact”  template  weights  and  RC-constant,  but  it  takes 
more  area  and  consumes  more  power  [11,13]. 

Early  CNN  implementations  were  designed  to  perform  one  function  in  image  process¬ 
ing  or  classification,  such  as  edge  detection  [2,3],  connected  component  detection  [11],  or 
hole  filling.  More  recently  programmability,  i.e.  the  ability  to  change  template  values  elec¬ 
tronically,  has  been  studied  in  detail  [6,7,10,23,24].  Furthermore,  in  some  implementations 
the  activation  function  of  some  CNN  chips  is  also  tunable  e.g.  [16,17],  the  slope  and  the 
threshold  of  the  activation  function. 

One  common  feature  of  currently  available  CNN  circuits  is  that  the  output  signals  are 
the  feedback  outputs  of  the  cells,  and  those  output  values  are  confined  as  binary  values  [1]. 
Hence,  the  output  image  is  a  black  and  white  image  even  when  the  CNN,  in  its  nature, 
is  an  analog  and  continuous  signal  processing  system.  The  binary  output  values  of  the 
CNN  are  the  positive  or  negative  threshold  of  the  activation  function.  Due  to  this  non¬ 
linear  sigmoid  function,  the  feedback  output  of  a  cell  can  converge  to  either  a  positive  or 
negative  value  imder  some  well  studied  “stability  conditions,”  such  as  An  >  1.  This 
characteristic  makes  the  CNN  very  attractive  for  some  pattern  extracting  applications, 
such  as  edge  detection  and  connected  element  detection,  where  a  binary  valued  output 
image  is  acceptable.  Moreover,  the  circuit  design  is  relatively  easy  if  the  output  is  just 
binary  rather  than  continuous,  since  the  linearity,  precision,  and  offsets  of  the  output 
values  are  not  relevant  [9,13,15,23]  because  the  steady  state  is  not  of  importance.  Worth 
mentioning  is  that  the  silicon  area  and  power  dissipation  are  greatly  reduced  because  of 
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the  tradeoffs  between  precision  and  area,  or  power  dissipation. 

However,  in  some  cases,  the  binary  output  CNN  is  not  enough.  For  example,  in 
order  to  solve  a  group  of  differential  equations,  or  to  build  a  real  time  control  system, 
or  to  obtain  an  output  image  with  multiple  gray  levels  (color  levels),  a  CNN  with  linear 
continuous  outputs  is  required.  Since  the  feedback  outputs  are  limited  by  thresholds,  e.g. 
-1  and  1,  state  variables  have  relatively  wider  dynamic  ranges  than  the  feedback  outputs 
and  therefore  can  be  used  as  continuous  outputs.  In  some  CNN  theoretical  research  papers 
[25],  the  state  variable  (or  state  output)  has  already  been  mentioned  as  a  useful  continuous 
information  of  the  CNN.  Some  authors  also  define  state  variables  as  the  roots  of  the 
differential  equations  and  hence  solve  differential  equations  with  the  CNN. 

So  far,  except  for  the  above  fundamental  work,  there  axe  no  circuits  that  have  been 
fabricated  or  designed  for  the  pmpose  of  obtaining  continuous  state  outputs.  Although 
in  some  CNN  chips  the  state  variables  could  be  detected  [13,18],  they  were  not  used  as 
outputs.  Some  major  electrical  problems  raised  by  the  design  of  such  a  CNN  chip  axe 
smnmarized  below: 

1.  The  linearity  of  each  circuit  block,  including  the  multipliers  (associated  with  template 
values),  linear  resistor,  and  the  activation  fimction  (piecewise  linear  sigmoid  function) 
must  be  seriously  considered.  Any  distortion  in  those  blocks  will  contribute  to  the 
non-lineax  error  of  the  state  variable  output. 

2.  The  dynamic  range  of  the  state  variable  is  boimded  by  the  linear  range  of  circuit 
blocks  or  the  power  supply  voltage.  In  order  to  get  a  high  precision  and  resolution 
output  signal,  a  wide  swing  range  is  needed,  but  the  area  and  power  dissipation  of  the 
chip  have  to  be  increased  in  turn. 

This  paper  addresses  two  key  aspects  in  the  field  of  Cellular  Neural  Networks.  One 
is  the  development  of  a  monolithic  prototype  (a  3  x  3  CNN  array)  that  uses  the  state 
node  as  an  external  output  for  gray  level  processing.  The  second  aspect  is  the  integration 
of  this  IC  in  a  complex  system.  It  is  necessary  to  stress  out  that  the  state  of  the  art 
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work  in  Cellular  Neural  Networks  has  concentrated  on  VLSI  implementations  without 
really  addressing  the  “systems  level” .  While  efficient  implementations  have  been  reported, 
no  reports  have  been  presented  on  the  use  of  these  implementations  for  processing  large 
complex  images.  The  work  hereby  presented  introduces  a  strategy  to  process  large  images 
using  small  CNN  arrays.  The  approach,  time-multiplexing,  is  prompted  by  the  need  to 
simulate  hardware  models  and  test  hardware  implementations  of  CNN.  For  practical  size 
applications,  due  to  hardware  limitations,  it  is  impossible  to  have  a  one-on-one  mapping 
between  the  CNN  hardware  processors  and  all  the  pixels  in  the  image  involved.  This  paper 
presents  a  practical  solution  by  processing  the  input  image  block  by  block,  with  the  number 
of  pixels  in  a  block  being  the  same  as  the  number  of  CNN  processors  in  the  hardware. 

2.1  System  Structure 

The  CNN  IC  consists  of  a  3  x  3  array  with  shared  input /output  pins.  Salient  features 
of  this  implementation  are  full  template  programmability,  a  programmable  RC  integration 
constant  and  an  external  output  at  the  state  node. 

Figure  1  presents  a  modular  view  of  the  CNN  IC  along  with  I/O  signals. 

•  &125  ^33  are  the  pins  to  set  the  analog  template  values  of  where  i,/  =  1,2,3. 

•  oii,ai2>  — ) 033  are  the  pins  to  set  the  analog  template  values  of  Aij,  where  i,j  = 
1,2,3. 

•  lOl,  102,  •  •  •  ,.109  are  the  input-output  pins  of  all  nine  cells.  The  pin  of  each  cell  is 
used  to  do  the  functions  of  setting  the  boundary  conditions,  initializing  the  state,  and 
of  providing  external  input  values  to  the  cell,  as  well  as  obtaining  its  state  output. 

•  d\  and  d2  are  control  signals  to  multiplex  each  input-output  pin  for  different  functions 
at  different  time  periods. 

•  Vjioa  is  the  offset  bias  voltage  for  the  templates,  and  is  a  tuning  voltage  of  the 
active  resistor. 
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•  5V,  -5V,  IV,  -IV  are  power  supplies  for  the  circuit  and  for  the  function  activation, 
respectively. 

Two  control  signals  are  used  as  switching  signals  to  multiplex  inputs,  outputs,  and 
cell  initial  conditions  in  order  to  let  them  use  the  same  pins.  Data  lines  are  shared  by 
analog  inputs,  botmdary  values,  initial  conditions  of  cell  state  variables,  and  outputs  of 
state  variables.  The  logic  codes  and  sequences  of  pin  multiplexing  is  shown  in  Table  1. 

Table  1:  The  codes  and  sequence  of  pin  multiplexing  operations. 


sequence 

code 

operation 

1 

11 

set  boundary  values  (in) 

2 

10 

set  initial  conditions  (in) 

3 

01 

set  input  voltages  (in) 

4 

00 

extract  the  steady-state  outputs 

The  pin  multiplexing  scheme  is  shown  in  Figure  2.  Capacitors  (0.6pF)  are  added 
to  hold  the  input  information  when  the  circuit  is  switched  to  the  output  mode.  This 
capacitance  value  is  designed  to  eliminate  the  feed-through  effects  of  CMOS  switches,  and 
for  the  same  purpose,  all  analog  switches  axe  transistors  with  minimal  size.  As  a  result,  the 
output  is  kept  unchanged  when  the  pin  is  switched  from  the  input  to  the  output  voltage. 
The  terminal  to  set  the  initial  condition  is  connected  to  the  state  variable  node  of  the  cell. 

2.2  Cell  Core 

Figxire  3  presents  the  hardware  realization  of  a  CNN  cell.  Here  the  RC  integrator  is 
composed  of  an  opamp  and  an  OTA  which  is  in  the  feedback  path.  The  OTA  is  used  to 
act  as  an  active  resistor  ,  with  a  value  =  l/Rx.  The  purpose  of  adding  the  opamp  is  to 
isolate  the  RC  integrator  from  the  19  multipliers  used  to  implement  the  weights  of  both  A 
and  B  templates. 

Observe  that  when  the  19  multipliers  are  connected  in  parallel  a  much  smaller  net 
output  resistance  than  that  of  just  one  multiplier  (divided  by  19),  and  also  a  much  larger 
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net  parasitic  capacitance  than  that  of  one  multiplier  (times  19)  appear.  These  two  non¬ 
ideal  elements  could  reduce  the  value  of  Rj,  and  increase  the  value  of  Cj,  in  the  structure 
of  Figme  3  because  of  their  parallel  connections  with  each  other.  However,  the  virtual 
ground  point  (non-inverting  input)  of  the  opamp  can  isolate  the  output  impedances  of  the 
multipliers  from  and  Cx-  On  the  other  hand,  the  virtual  groimd  makes  each  multiplier 
have  a  virtual  zero  load,  and  thus  eliminates  the  load  effect  on  the  multiplier  which 
comes  from  the  finite  output  impedance  of  the  transconductance  multiplier. 

Another  advantage  of  isolating  the  large  aggregated  parasitic  capacitance  of  19  mul- 
tipliers  is  that  the  value  of  Cx  can  be  controlled  by  a  single  capacitor,  rather  than  by 
many  parasitic  values.  In  this  way,  it  is  possible  to  control  the  time  constant  of  the  RC 
integrator.  More  importantly,  the  mismatches  of  the  RC  time  constants  of  the  cells  are 
much  smaller  if  the  value  of  Cx  is  decided  by  one  monolithic  capacitor  rather  than  by  many 
distributed  parasitic  values. 

To  prevent  the  RC  integrator  from  entering  into  oscillation,  the  value  of  Cx  has  to 
be  large  enough  to  form  a  dominant  pole,  and  thus  compensate  for  the  phase  shift  of  the 
integrator.  Assuming  the  opamp  is  an  ideal  opamp  and  without  Cx,  the  transfer  ftmction 
of  the  integrator  can  be  expressed  as: 

Vinis)  gxi'i- -  Pis)(l  -  P2S)'  ^ 

where  G(s)  indicates  the  transconductance  of  a  multiplier  (to  simpHfy  the  presentation, 

assume  just  one  multiplier);  is  a  constant;  p\  and  p2  are  two  zeros  of  the  OTA,  but  here 
they  become  poles.  In  Equation  1,  the  OTA  is  simplified  as  a  2nd  order  system.  The  two 
poles.  Pi  and  p2,  are  possible  factors  to  cause  oscillations  depending  on  the  total  phase 
shift.  The  most  convenient  way  to  stabilize  this  transfer  function  is  to  choose  an  adequate 
value  of  Cx,  and  thus  compensate  the  phase  shift.  After  eidding  Cx,  Equation  (1)  is  written 
as: 

^  _ G(£) _ 

t'inC-s)  9x[0--P\s){l-P2s)->rCxslgx]  ^  ^ 

In  Equation  2,  MCxjgx  »  P\,P2,  there  will  be  a  dominant  pole  based  upon  the  ratio 
CxIqx  that  mahes  the  integrator  be  stable. 
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Finally,  the  buffer  in  Figiire  3  is  basically  an  opamp  with  unity-gain  feedback  con¬ 
nection  whose  function  is  to  isolate  the  state  variable  node  of  the  cell  from  the  outside 
environment. 


6 


2.3  Multiplier  Circuit 


For  the  design  of  a  continuous  output  CNN,  the  general  requirements  for  the  transcon¬ 
ductance  multiplier  are  the  linearity  and  its  tolerance  to  process  mismatches.  After  evalu¬ 
ating  several  analog  multipliers,  the  best  choice  was  found  to  be  the  Gilbert  transconduc¬ 
tance  structure,  which  has  been  extensively  used  in  the  design  of  analog  neural  networks. 
The  Gilbert  multiplier  has  the  advantage  that  its  linearity  is  relatively  insensitive  to  pro¬ 
cess  mismatches  because  of  its  symmetric  structure  and  differential  input  and  differential 
output.  The  tradeoffs  of  the  above  good  news  is  having  to  enlarge  the  layout  area  and 
increase  its  power  dissipation  to  get  satisfactory  results. 

The  multiplier  used  here  is  a  folded  Gilbert  multiplier  with  linearity  enhancement, 
which  is  illustrated  in  Figme  4.  PMOS  transistors  Ml  and  M2  form  a  current  mirror  pair 
to  supply  a  biasing  current  to  the  input  pair  M3  and  M4,  thus  /g  =  Jg  =  Jj/2.  M3  and  M4 
are  biased  in  their  linear  regions  as  source  degenerated  resistors,  whose  functions  are  to 
expand  the  linearity  of  the  input  pair  M5  and  M6.  is  a  control  voltage  that  represents 
the  template  value.  In  the  following  equations,  M3  and  M4  are  ignored  for  convenience  of 
the  mathematical  analysis  of  the  multiplier  principle. 

It  can  be  proven  that  the  output  differential  current  of  this  multiplier  is 

loi  -  Io2  «  k'Vi^{^2h/k'  -  v'2Jio/A:0 

=  k'VinV^iVgss  -  Vgse  +  VthP  -  Vthp) 

=  V^VinVc.  (3) 

where  k  =  ks  =  k^  =  \  and  k'  =  ku  =  ki2  =  k^  =  k^ 

Equation  3  is  the  fundamental  equation  to  perform  four-quadrant  multiplications  with 
the  input  voltage  Vi„  and  the  control  voltage  Vc-  The  hnearity  of  the  circuit  can  be 
improved  by  either  using  long  channel  transistors  or  large  biasing  currents,  since  the  value 
of  y/2IiQ/k'  can  be  much  larger  than  Vin- 

In  order  to  save  area  and  power  dissipation,  a  summing  current  mirror  (shown  in 
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Figure  5)  is  utilized  to  collect  the  output  currents  of  all  19  multipliers  instead  of  adding  in¬ 
dividual  cmrent  mirrors  for  every  multiplier  at  their  output  stage.  This  cascode  summing 
current  mirror  is  needed  to  minimize  output  offsets  and  make  the  circuit  properties  insen¬ 
sitive  to  process  variations.  Table  2  shows  the  designed  circuit  parameters.  HSPICE  sim¬ 
ulations  showed  a  total  harmonic  distortion  (THD)  of  0.5%,  at  Vi„  =  l.OV  and  Vc  =  l.OF, 
and  a  power  dissipation  of  about  0.75mW  using  the  process  parameters  of  2fim  n-well 
CMOS  technology  from  MOSIS. 

Table  2:  Design  parameters  of  the  Gilbert  multiplier  in  Figure  4  (h  =  10/zA). 


transistor 

W/L{fim/nm) 

Ml  &  M2 

4/5 

M9  &  MIO 

9/4 

M3  &  M4 

3/7 

Mil,  M12,  M13,  &  M14 

3/26 

M5  &  M6 

3/22 

M7  &  M8 

8/4 

2.4  The  Linear  Tunable  OTA 

The  linear  resistor  of  the  cell  core  was  implemented  with  a  tunable  OTA.  The  tunabil- 
ity  of  the  OTA  is  also  indispensable  because  1 )  on— chip  tuning  is  required  to  compensate 
systematic  errors  raised  by  parameter  mismatches,  and  2)  CNN  programmability  involves 
that  the  value  of  Rx  be  also  adjustable.  The  traditional  method  of  tuning  the  transcon¬ 
ductance  of  the  OTA  is  by  adjusting  the  biasing  current  of  the  input  differential  pair.  The 
linearity  is  significantly  reduced  by  decreasing  its  biasing  current  because  of  the  quadratic 
relationship  between  the  gate-source  voltage  and  the  drain  current.  Hence,  this  method 
is  not  suitable  for  our  purposes.  One  more  adequate  way  to  simultaneously  improve  both 
the  linearity  and  tunability  of  the  OTA  is  to  utilize  a  programmable  current  mirror  [20,21]. 
This  approach  has  been  found  to  be  very  good  for  both  Unearity  and  tunability.  It  consists 
of  adjusting  the  gain  of  the  current  mirrors  rather  than  the  biasing  current  of  the  input 
pair.  In  our  case,  a  modified  programmable  cmrent  mirror  (programmable  Widlar  current 
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mirror)  is  presented.  It  has  a  simple  structure  and  its  performance  is  also  good. 

2.4.1  Circuit  Analysis 

The  circuit  structure  of  the  OTA  is  shown  in  Figure  6.  The  only  difference  of  this 
circuit  from  a  basic  CMOS  OTA  is  that  there  are  two  transistors  MRl  and  MR2,  which 
are  biased  in  their  linear  regions.  The  functions  of  both  active  resistors  are  to  tune  the 
current  gains  of  two  cxnrent  mirrors:  M3-M5  and  M4-M6,  as  well  as  to  increase  the  linear 
range. 

Let  the  linear  resistance  of  MRl  and  MR2  be  denoted  as  R,  jfc  =  jbj  =  jfcj,  k  =  ks  = 
k^  and  let  Vd  =  Vi„+  —  Vin-. 

It  is  reasonable  to  assume  the  values  of  k  and  k'  to  be  larger  than  10“®  /A,  R  larger 

than  lOOfcfi,  and  the  biasing  current  R  large  enough  (e.g.,  35uA).  Then  within  a  limited 
input  range  such  that  I3  and  I4  are  not  far  from  4/2,  we  have 


4Rv'^'(/3)  »  1 

(4) 

4R,yk'(ii) » 1 

(5) 

Therefore,  by  obtaining  J5  and  Jg  in  terms  of  I3  and  I4,  respectively  around  MRl  and 
MR2  the  output  current  can  be  approximated  as 

lout  =  4  -  jg  =  ^^[2RV^Vd  -  [2R^'^k'hy/*  -  2R^/^(k'hy/^]  (6) 

In  order  to  separate  the  Unear  and  non-Unear  terms  of  lout,  it  is  better  to  expand  it 
into  polynomial  expressions  in  terms  of  the  differential  input  voltage  Vd.  All  even  terms 
vanish  if  it  is  assumed  that  the  input  voltage  is  differential.  Disregarding  high  order  terms 
we  have 

lout  =  +  '  •  •  ? 

where 
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diout  1 

1 

3!  dV^ 


Evaluating  the  corresponding  derivatives,  equations  7  and  8  yield  respectively 


The  linear  term  in  Equation  9  is  the  conductance  of  the  OTA.  In  certain  cases,  such 
as  when  R  is  large  enough,  ai  can  be  further  simplified  as 

(11) 


Simultaneously  notice  from  Equation  10,  that  the  non-lineaxity  (3rd  order  distortion) 
of  lout  can  be  greatly  reduced  by  increasing  the  values  of  R  and  Jj.  Under  the  assmnptions 
made  in  Equations  4  and  5,  the  3rd  order  term  is  much  smaller  than  the  linear  term. 

Now  recall  that  since  MRl  and  MR2  are  working  in  the  hnear  range,  their  equivalent 


resistance  is 


kR{V,s  -  VtHp) 


Then,  it  can  be  concluded  that 


/o«. «  4j!(I  Vii-V,\-\  Vap  DKi^ 


Another  advantage  of  this  programmable  current  mirror  is  that  its  gain  bandwidth 
product  will  not  change  with  the  adjustment  of  its  conductance.  This  point  is  easy  to 
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understand  since  the  biasing  current  of  the  input  pair  /j  is  not  changed.  The  design 
parameters  of  the  OTA  of  Figure  6  are  listed  in  Table  3. 

Table  3:  The  design  peu'axneters  of  the  OTA  with  programmable 
Widlar  current  mirror  in  Figure  6  (h  =  25/iA). 


transistor 

WfL  (fim/ fim) 

Ml  &:  M2 

3/35 

M3  k  M4 

4/4 

M5  &  M6 

Alb 

MRl  k  MR2 

3/19 

M7,  M8,  M9  k  MIO 

4/16 

The  OTA’s  input  range  is  from  — 3^  to  3V.  The  timing  range  of  the  conductance  is 
from  2  X  to  5  x  The  simulated  THD  vs.  input  voltages  is  plotted  in 

Figure  7.  Although  there  are  differences  between  different  input  voltages,  these  variations 
are  within  the  tolerable  range  of  linearity.  The  estimated  total  power  dissipation  of  the 
OTA  is  0.45  mW. 

2.5  The  Activation  Function 

The  sigmoid  activation  function  plays  a  very  important  role  to  control  the  errors  and 
the  stability  of  the  CNN  [19].  The  most  important  aspects  to  consider  are 

1.  The  threshold  voltages  of  the  sigmoid  function  must  be  accurate  to  avoid  measurement 
errors  applied  to  the  output  voltage. 

2.  The  slope  of  the  linear  segment  should  be  the  same  among  all  cells  in  the  CNN.  Any 
mismatch  may  introduce  stability  problems. 

3.  The  slope  of  the  sigmoid  function  of  each  cell  should  be  about  1.0  to  maJce  the  cell 
more  stable  than  other  values. 

4.  The  slew  rate  of  the  output  voltage  should  be  high  to  have  a  short  settling  time. 
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The  circuit  structure  of  the  sigmoid  activation  function  is  shown  in  Figure  8.  It  is 
basically  an  opamp  in  unity  gain-feedback  connection,  but  the  power  supplies  of  the  first 
and  second  stages  of  the  opamp  are  different.  The  supply  voltages  of  the  first  stage  are 
—5  and  +5  volts,  while  the  voltages  of  the  second  stage  are  only  —1  and  +1  volts.  All 
power  voltages  are  supplied  external  to  the  IC.  The  advantage  of  using  independent  power 
supplies  in  one  opamp  is  that  the  threshold  voltages  of  the  sigmoid  function  are  well  defined 
and  programmable.  Therefore  we  do  not  have  to  use  hard  limiter  circuits,  whose  threshold 
voltages  are  always  significantly  variable  with  process  variations. 

The  structure  of  the  feedback  connection  of  a  high— gain  opamp  guarantees  that  the 
slope  of  the  input-output  characteristic  curve  is  almost  ideal  1.0  and  that  there  are  sharp 
turning  corners  at  the  points  of  Vin  =  -l.OF  and  Vin  =  l.OV. 

These  conditions  make  the  active  function  be  a  perfect  piecewise  linear  sigmoid  func¬ 
tion.  However,  the  deep  feedback  connection  of  the  opamp  may  introduce  stability  prob¬ 
lems.  A  compensation  capacitor  Cc  has  to  be  added  to  compensate  for  the  phase  shift, 
but  Cc  will  contribute  to  the  time  delay  of  the  activation  function.  Within  one  chip,  it  is 
acceptable  to  assume  that  the  relative  mismatch  (or  the  variation  of  the  ratio)  of  the  CcS 
between  two  cells  is  very  small,  so  the  time  delay  mismatch  caused  by  Cc  is  not  critical. 
The  simulated  slew  rate  of  the  activation  function  is  about  10 Vy^is,  at  the  load  capacitance 
of  IpF .  The  total  power  dissipation  of  the  activation  fimction  is  0.38  mW.  Table  4  lists 
the  design  parameters  of  the  activation  function. 

Table  4:  The  design  parameter  of  the  activation  function  circuit  (h  =  15//A)  . 


transistor 

W/L  {fim/fim) 

Ml  &M2 

4/4 

M3,  M4,  M5  k  M6 

5/4 

M7  k  M8 

7/4 
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2.6  Opamp 


There  are  two  opamps  in  each  cell;  one  is  in  the  cell  core;  another  one  is  used  as  a 
buffer  to  isolate  the  parasitic  capacitance  of  the  outside  world  from  the  state  variable  node. 
Both  opamps  have  a  one-stage  opamp  structiure  as  shown  in  Figure  9.  The  advantage  of 
using  a  one-stage  opamp  is  that  it  is  more  stable  than  a  two-stage  opamp,  and  the  load 
capacitance  does  not  affect  the  stability  of  the  opamp.  The  gain  of  the  one-stage  opamp 
is  not  very  high,  but  is  enough  for  our  applications.  Its  design  parameters  and  some 
important  HSPICE  simulation  results  are  listed  in  Table  5  and  Table  6,  respectively. 

Table  5:  The  design  parameter  the  opamp  shown  in  Figure  9,  /j  =  20/zyl. 


transistor 

W/L  (fim/fim) 

Ml  &:  M2 

4/4 

M3,  M4,  M5  k  M6 

5/4 

M7,  M8,  M9  &  MIO 

5/4 

Mil,  M12,  M13  k  M14 

4/7 

Table  6:  HSPICE  simvdated  results  of  the  one-stage  opamp  in  Figure  9. 


parameter 

result 

DC  gain 

65  dB 

Dynamic  swing  range 

-3  to  -I-3V 

phase  margin 

50“ 

gain  bandwidth  product 

4.0  MHz  {Cl  =  Ipf) 

slew  rate 

IQV/ps 

power  dissipation 

0.42mW 

3.  TIME-MULTIPLEXING  HARDWARE  SIMULATION 

In  time-multiplexing  hardware  simidations  one  can  define  a  block  of  pixels  (subimage) 
which  will  be  processed  by  an  equal  number  of  CNN  cells  [22].  Once  convergence  is 
achieved,  a  new  subimage  adjacent  to  the  one  just  processed,  is  scheduled  for  further 
processing.  This  procedure  is  repeated  until  the  whole  image  has  been  scanned  using  a 
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lexicographical  order,  say,  from  let  to  right  and  from  top  to  bottom.  It  is  obvious  that 
with  this  approach  the  processing  of  large  images  becomes  feasible  in  spite  of  the  finite 
number  of  CNN  cells. 

Even  though  the  approach  seems  simple  and  appealing,  an  important  observation  is 
necessary:  the  processed  border  pixels  in  each  subimage  may  have  incorrect  values  since 
they  are  processed  without  neighboring  information.  Hence,  to  cope  with  the  previous 
problem,  two  sufficient  conditions  must  be  considered  to  ensure  that  each  border  cell 
properly  interacts  with  its  neighbors.  These  conditions  are:  1)  to  have  a  belt  of  pixels 
from  the  original  image  arotmd  the  subimage  being  processed,  and  2)  to  have  pixel  overlaps 
between  adjacent  subimages.  We  will  go  into  the  details  of  these  two  constraints  in  the 
next  subsection. 

3.1  Sufficient  and  Necessary  Conditions  for  Time  Multiplexing 

Notice  first  that  in  the  absence  of  template  -A  values  the  error  is  both  image  and 
template  -B  dependent.  In  other  words,  the  steady  state  of  a  border  cell  may  converge 
to  an  incorrect  value  due  to  the  absence  of  it’s  neighbors  weighted  input.  One  can  easily 
conclude  that  the  error  is  canceled  if  the  missing  external  inputs  are  provided  to  the  border 
cells  as  depicted  in  Fig.  10a.  Since  typically,  the  array  is  “embedded”  in  the  image  during 
operation,  this  condition  can  easily  be  satisfied. 

Let  us  address  now  the  interactions  among  cells.  The  problem  in  this  situation  is  more 
involved  because  the  output  signals  depend  on  the  state  of  their  corresponding  cells.  To 
minimize  the  error  an  overlap  of  pixels  between  two  adjacent  blocks  is  proposed,  see  Fig. 
10b.  In  this  form,  the  inner  cells  of  the  CNN  array  will  always  receive  weighted  processing 
information  from  the  border  cells. 

The  general  time-multiplexing  procedure  consists  in  processing  each  image  block  until 
all  CNN  cells  within  the  block  converge.  The  block  with  converged  cells  will  have  state 
output  variables  which  are  the  values  used  for  the  final  output  image.  Every  time  that  a 
new  subimage  is  processed,  the  physical  CNN  array  is  initialized  to  the  initial  conditions 
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of  the  original  image,  or  to  black  or  to  white  as  required  by  the  template  in  use.  In  the 
overlapping  procedure  the  outer  overlapped  cell’s  converged  values  are  discarded  since  they 
were  computed  with  incomplete  neighboring  information.  Only  the  inner  cell’s  converged 
values  are  kept  as  valid  values.  This  implies  that  for  a  neighborhood  radius  of  1,  an  overlap 
of  two  pixel  column/rows  is  needed  to  be  able  to  ensure  correct  values  for  pixels  assigned 
to  the  border  cells.  With  the  added  overlapping  feature,  better  neighboring  interactions 
are  achieved,  but  at  the  same  time,  an  increase  in  computation  time  is  inevitable. 

With  the  previous  multiplexing  scheme  the  image  needs  to  be  iterated  several  times 
over  newly  obtained  states  to  allow  the  proper  propagation  of  global  effects.  Multiple 
iterations  are  necessary  to  guarantee  that  all  cells  have  converged  to  correct  values  taking 
into  account  all  global  effects.  This  can  be  inferred  by  considering  a  diagonal  propagation 
of,  say,  a  black  pixel  in  a  fully  white  image.  Notice  that  without  overlaps  it  is  impossible 
to  propagate  global  effects,  and  that  the  propagation  is  achieved  with  at  least  one  over 
lapped  pixel. 

For  the  purpose  of  better  imderstanding  the  over8dl  idea  of  this  time-multiplexing 
approach,  a  simplified  algorithm  is  presented  below.  Assume  and  M  by  N  image,  an  m  y 
n  CNN  array,  pixel  values  Eij  and  o  overlaps. 


for  (i  =  1;  t  <  M;  i  =  i  +  m  —  o)  { 
for  (j  =  1;  i  <  JV;  j  =j  +  n-o)  { 
u(i,j)  =  Ei^j 


Uij 


white 

Xi  “t"  m,  j  =  ^i+m,  j+n(^n)  "h  f  (^»  d*  ^5  j  d*  ^(tn))  dt 


black 

white 


{ 


{ 
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4.1  Testing  of  Electrical  Parameters 


The  3x3  CNN  chip  was  fabricated  with  MOSIS  n-well  2.0/im  process.  The  photograph 
of  the  die  is  shown  in  Figure  11a,  where  all  cells  are  arranged  as  a  3  by  3  array.  The  die 
area  of  the  circuit  is  approximately  3.2mm^,  and  the  power  dissipation  is  less  than  250 
mW.  The  photograph  of  one  cell  is  in  Figure  11b,  whose  layout  area  is  0.19mm^. 

Two  important  circuit  building  blocks;  the  analog  multiplier  and  the  tunable  linear 
OTA,  were  also  fabricated  in  separate  chips  for  testing  purposes.  The  DC  sweep  curves  of 
the  OTA  at  different  values  of  Vc  are  shown  in  Figure  12.  The  DC  sweep  characteristics 
of  the  multiplier  are  shown  in  Figure  13. 

The  linearity  of  the  central  cell  (7(2, 2)  can  be  calibrated  by  adjusting  the  gm  of  the 
tunable  OTA  using  the  procedme  previously  described.  It  is  necessary  here  to  deduct  an 
amount  of  — 0.51F  from  Vnas,  since  this  value  is  used  to  cancel  the  output  offset  of  C(2, 2) 
and  cannot  be  counted  in  the  calculation  of  linearity. 

The  CNN  chip  is  connected  to  a  personal  computer  (PC)  through  an  A/D  and  a 
D/A  interfacing  board.  The  system  connections  are  described  in  Figure  14.  The  opera¬ 
tions  of  setting  inputs  and  getting  outputs  from  the  CNN  chip  are  multiplexed  externally 
by  4-1  analog  multiplexer  chips  (ADG509A).  External  operations  are  synchronized  with 
the  multiplexing  operations  inside  the  chip.  The  type  of  A/D  boeird  was  AT-MIO-16F, 
which  has  12  A/D  channels;  the  type  of  D/A  board  was  AT-AO-6/10,  which  has  10  D/A 
channels.  Both  are  products  of  National  Instruments.  The  pin  multiplexing  control  code, 
is  generated  by  a  computer  program  and  interfaced  through  the  digital  I/O  port  in  the 
AT-MIO-16F  board. 

The  detailed  connections  of  the  analog  multiplexers  are  shown  in  Figure  15,  where 
the  opamps  are  added  as  A/D  output  buffers  to  isolate  the  output  node  from  the  parasitic 
capacitance  of  the  wires  and  the  A/D  board. 

The  operating  sequence  is  listed  as  below: 
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1.  Initialize  the  A/D  and  D/A  boards.  Set  the  reqviired  template  values  by  providing 
the  corresponding  analog  voltages  for  the  template  values. 

2.  Set  the  boundary  values  of  the  3  x  3  CNN,  and  the  initial  values  of  all  the  CNN  cells. 

3.  Map  the  pixel  values  (0  to  255)  of  the  input  image  into  CNN  input  voltage  values 
(l.Oy  to  — l.OV),  and  send  them  to  the  chip. 

4.  Extract  the  steady-state  values  (3.0V  to  -3.0V)  of  the  state  variables  of  all  cells  and 
map  them  to  pixel  values  (0-255)  of  the  output  image. 

5.  For  time  multiplexing  applications  [22],  move  to  another  position  in  the  input  image 
and  repeat  at  step  2. 

Another  import  function  of  tuning  gm  is  to  expand  the  adjustable  range  of  template 
values.  For  example,  reducing  can  increase  the  value  of  (=  G/gm  x  hij). 

4.2  Image  Processing  Applications 

The  following  comprises  several  examples  of  image  processing  apphcations  using  this 
3x3  CNN  chip,  with  output  pixels  at  the  state  outputs.  The  size  (number  of  pixels)  of 
the  input  and  output  image  are  256  x  256.  The  number  of  gray  levels  is  255.  We  chose 
a  gray  level  image  and  a  color  image  as  demonstration  vehicles  of  the  potential  of  CNN 
state  output  nodes,  see  Figs.  16a  and  16b. 

The  image  processing  is  realized  by  using  the  time  multiplexing  method.  Each  time 
the  CNN  chip  only  processes  a  3  x  3  pixel  array  of  the  image,  but  the  border  cells  of  the 
CNN  are  overlapped  between  the  two  neighbor  arrays  to  have  correct  boundary  dynamics. 
Therefore,  only  cell,  C(2,2)  gives  the  output  pixel  value. 

4.2.1  Edge  Detection 

The  edge  detection  templates  are  the  following  [2]: 


17 


B 


0 

-.48 

0 

-.48 

2.0 

-.48 

0 

-.48 

0 

A 


0 

0 

0 

0 

2.0 

0 

0 

0 

0 

Vbias  =  -0.15F 


The  processed  black  and  white  output  image  for  the  template  at  A(i,j]i,j)  >  1/Rx 
is  displayed  in  Figure  17a.  The  color  results  are  shown  in  Fig.  17b.  In  our  experiments, 
the  value  of  Vtiaa  affects  the  results  very  much.  In  order  to  obtain  a  good  edge,  several 
adjustments  of  VbUs  may  be  needed. 

4.2.2  Hole  Filling 

The  hole  filling  ftmction  can  be  used  in  contrasting  operatons  and  noise  removal.  This 

function  is  realized  by  the  mutual  feedback  of  output  pixel  values. 

B  A 


Vbias  =  -0.857 

The  corresponding  output  images  are  in  Figiures  18a  zmd  18b. 
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5.  Conclusions 

This  paper  demonstrated  the  feasibility  of  processing  large  images  using  small  CNN 
arrays.  For  practical  image  size  applications,  due  to  current  state  of  the  art  technological 
limitations,  it  is  impossible  to  have  a  one-on-one  mapping  between  the  CNN  hardware 
cells  and  all  the  pixels  in  the  image  involved.  It  is  thus  a  key  issue  the  proper  use  of 
time-multiplexing  implementations  in  common-day  situations.  Additionally,  it  was  shown 
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that  a  state-node  output  approach  is  especially  suitable  for  color  image  processing.  This 
approach  is  not  limited  by  the  constraint  An  <  1  in  order  to  obtain  analog  values. 
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Figure  3.  The  building  structure  of  the  CNN  cell.  “M”  means  a  transconductance  multi 
plier  with  a  cxirrent  output. 


to  the  summing  current  mirror 


Figure  4.  Circuit  of  the  folded  Gilbert  multiplier  with  linear  expansion.  Vc  is  the  control 
voltage  (temple  input),  and  Vin  is  the  image  pixel  input. 


Figure  5.  The  summing  current  mirror  of  all  19  multipliers. 
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Figure  6.  The  circuit  of  the  tunable  linear  OTA  using  programmable  current  mirrors. 

Vin  is  the  input  voltage  from  — 3V  to  -|-3V ;  Vt  is  the  tuning  signal  of  the  OTA 
conductmce. 
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Figure  7.  Plots  of  THD  vs.  input  voltages  (peak-to-peak  value)  of  the  “basic”  OTA  amd 
the  tunable  OTA  at  Vt  =  —2V  and  Vt  =  2V. 


Figure  11a.  The  photograph  of  the  die  of  the  3x3  CNN. 
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Figure  14.  The  general  interfacing  system  of  the  CNN  chip  with  a  person  computer. 
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Figure  15.  The  detailed  connections  of  the  4-1  analog  multiplexers  outside  the  CNN  chip 


Figure  18a.  The  output  images  after  applying  the  hole  filling  template(gray  level  image). 
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